1. Understanding Full-Text Search
Imagine that you’re implementing a search engine for your website.
You could write some simple database queries to search your content. You
might use the LIKE operator in SQL to
do simple pattern matching against the search queries. However, you’ll
soon find out that this breaks down even in simple searches.
The terms users search for may not appear together. Users may
search for variants of the term taxes where your
database has only tax, or
players where your database has only
player. Since everyone is familiar with Google,
your users will expect these search queries to work and to “do the right
thing.” This is way beyond what you can do with a few simple SQL
queries.
Performance would be terrible, since the database would need to
look through every single row to find the data you want. Unlike a
numeric column, you could not create an index on which the execution
engine could perform a binary search to find the right row to go
to.
Though you cannot build Google, you can provide rudimentary
full-text search capability. This is available out of the box with most
modern RDBMSs—SQL Server has had it for a couple of versions, and MySQL
added it in version 3 and enhanced it considerably in later versions.
This feature examines all the words in every stored document (where
“document” refers to any content you stick in the database), and
attempts to match that with the user’s queries.
Full-text search (FTS) engines are smart enough to
recognize different versions of the same word. For example, searching
for any of the words driven,
drove, or drives should bring
up drive as a result. FTS engines also know to
detect similar phrases and perform rudimentary Boolean logic where you
can search for the existence of multiple terms. They also typically
contain rudimentary ranking algorithms to rank the results they
find.
Another popular option is to use open source FTS projects such as
Lucene that use the filesystem as a store. However, these don’t
typically work on Windows Azure, or don’t fit into the stateless
frontend model that the cloud demands, since they use the filesystem as a backend store.
Note: Although there are ways to change the backend store of these
projects and make them store data elsewhere, they’re sparsely used at
this point, and in general, they don’t work as well as using the
filesystem. It would be quite difficult to optimize for Azure’s Table
service, as we will explore throughout this discussion.
2. Indexing
Now let’s get to the fun part: what makes these FTS engines tick. No
discussion on FTS engines can get far without a discussion on
indexing—or, to be specific, a discussion on indexes.
You’re probably familiar with the role of an index in a book. If
you turn to the back of this book, you’ll find one. If you look up the
term full-text search, it should refer you back to
this chapter. Now, try to imagine a book without an index. Finding a
specific term in such a book would be a painful task. You would need to
turn page by page, skim through the text, and try to guess in which
chapter or section it appears. You may succeed quickly in a slim book,
but imagine a book with several hundred pages, or even thousands of
pages.
Searching through a database poses the same challenge, but on a
much larger scale. Instead of having to read through thousands of pages,
the computer must look through the equivalent of millions of pages (if
not more). Even though databases are fast at looking through things,
this is still too slow a process, and entails too much data to walk
through sequentially.
The fix is simple. Create an index of the data in the database,
just like the index at the back of this book, and have the computer look
up the term in the index to find out where the data resides. This raises
a question: what about the time it takes to look up the term in an
index?
Think for a second about how you look up terms in a book’s index.
Since the index is alphabetically sorted, you know whether to go
backward or forward, and how many pages to skip. Similarly, the index in
FTS engines is stored in a sorted order. The FTS engine can perform a
binary search through the index to quickly arrive at the right
term.
At this point, if you were using a typical RDBMS, you probably
know as much about FTS indexes as you need to. In theory, you could just
ask the database to start indexing specific columns in your tables.
However, as mentioned earlier, Azure storage does not come with this out
of the box. That means it is up to developers to do the heavy lifting to
build these indexes and write the code to look through them.
You will have help when doing this. You can use Azure storage to
store the actual data, and use Azure tables to query the index. This
enables you to index and query over really large datasets without having
to worry about memory or storage capacity issues.
Note: Will Azure storage have FTS sometime in the future? At the
moment, there have been no announcements from Microsoft putting FTS on
the official road map. However, the Azure team knows this is something
a lot of customers want, and is actively investigating adding this in
a future release. If you’re reading this book a few years after
publication, chances are this feature may be part of the feature set,
and you can skip over this entire section.
2.1. Documents and terms
Let’s become familiar with two terms that are common in the FTS
world:
Document
A document is the unit of content that is
returned in your search results. For web search engines such as
Google and Bing, a document is a web page. If you’re building an
email search, a document is an email. You can think of it as the
thing that shows up in the query results. Documents can be of
any granularity. If you were indexing books, you could make an
entire book one big document. Or you could drill down and make
each chapter, or each section, a document to enable you to get
as close as possible to the exact location of the search term.
The index at the back of this book uses individual pages as its
“documents,” since the “result” is a page number you can turn
to.
Term
A document is made up of several terms. Typically, for textual content,
every term can be roughly approximated to a word—the previous
sentence can be said to have eight terms. Note the word
approximated, because you might be
transforming the term before indexing to help with your search
results.
Now let’s take a look at transforming.
2.2. Case folding and stemming
Following is a famous rhyme from the book The Lord of
the Rings by J.R.R. Tolkien (George Allen and Unwin). This
will be used for the next few examples.
Three Rings for the Elven-kings under the sky,
Seven for the Dwarf-lords in their halls of stone,
Nine for Mortal Men doomed to die,
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all and in the darkness bind
them
In the Land of Mordor where the Shadows lie.
Let’s assume for a moment that each word in this poem is a term,
and each line is a document in itself. If a user searches for “one”
(note the lack of capitalization), you would want to show all the
lines starting with the text “One Ring.” To put it differently, users
shouldn’t need to care about the case of the queries they’re typing
in.
To support this, you must transform your documents by
case-folding into a standard case, either all
uppercase or all lowercase. This makes life much easier when searching
for this text, since you don’t have to worry about the case of the
actual content. For example, you would convert the first line in the
poem into “three rings for the elven-kings under the sky.”
Variants on terms are a trickier issue. Look at the third line
in the poem. If someone searched for the term
doom, you should be able to recognize that
doomed is a variant of doom.
This is almost impossible to do at query time.
The right thing to do is to convert every word into its root
form through a process called stemming. During indexing, you convert every
word into a stemmed version. For example, instead of storing
doomed, you would store the word
doom. Plurals, tenses, and other word variants
will be converted into one standard form. At query time, you convert
the terms in the query into their stemmed version as well.
There are several well-known stemming algorithms with public,
open source libraries. Throughout this chapter, you will use a simple,
well-known implementation called the Porter Stemming
Algorithm. You can find implementations of Porter’s
algorithm for various languages (as well as information on the
algorithm itself) at http://tartarus.org/~martin/PorterStemmer/.
Note: Big search engines such as Google and Yahoo! use complex
stemming and case-folding algorithms that go beyond the simple
process just described. For example, the preceding algorithms can’t
deal with international input—most open source stemming libraries
work only with English text. Even something as simple as case
folding becomes tricky with some languages in which “cases” don’t
work in the same way they do in the English language. If you’re
interested in exploring these topics in detail (and exploring
searching in general) you should pick up any text on information
retrieval and start digging. It is a vast and rich field.
2.3. Inverted indexes
The correct term for the kind of index you’ll be building
here is an inverted index (or
reverse index). This is not a novel computer science data structure.
Religious scholars have used them for centuries, and important works
have had concordances created for them manually. (A
concordance is an alphabetical listing of key
terms in a work, with the identification or citation of the passages
in which they occur.)
Note: One particular famous (but unintended) use of concordances
involved the Dead Sea Scrolls, the only known surviving copies of
Biblical documents made before 100 A.D. After the discovery of the
Dead Sea Scrolls in the 1940s, the team that discovered and
excavated the caves refused to publish most of the scrolls, but
published a concordance that contained a mapping of terms to the
scroll in which they appear. In 1991, a Cincinnati student entered
the concordance into his computer and reconstructed the original
underlying text. This caused the immediate release of the facsimile
edition of the originals.
The best way to describe these indexes is with an example. They
are typically made up of two data structures. One data structure
contains a mapping of document IDs to the document’s contents. Using
the earlier poem from The Lord of the Rings, this
mapping would look like Table 1
if you assume every line to be a document by itself.
Table 1. Document ID to document mapping
Document
ID | Document |
---|
0 | Three Rings for the
Elven-kings under the sky, |
1 | Seven for the
Dwarf-lords in their halls of stone, |
2 | Nine for Mortal Men
doomed to die, |
3 | One for the Dark Lord
on his dark throne |
4 | In the Land of Mordor
where the Shadows lie. |
5 | One Ring to rule them
all, One Ring to find them, |
6 | One Ring to bring them
all and in the darkness bind them |
7 | In the Land of Mordor
where the Shadows lie. |
An inverted index contains a list of pointers of terms to the
documents in which they appear. The list of terms is exhaustive—if you
add up all the terms in the inverted index, you will have a lexicon of
all terms that appear in all the documents that were indexed. Table 2 shows a snippet of an inverted index
constructed from Table 1.
(The full inverted index contains more than 45 terms.)
Table 2. Inverted index
Term | Document
IDs |
---|
Three | 0 |
Rings | 0 |
for | 0, 1, 2,
3 |
the | 0, 0, 1, 3, 4, 4, 6, 7,
7 |
Elven-kings | 0 |
under | 0 |
sky, | 0 |
Seven | 1 |
Dwarf-lords | 1 |
Mordor | 4, 7 |
Ring | 5, 5, 6 |
Looking at Table 2, a few things are
apparent. First, you can see that a few common words dominate in
occurrence (in this case, the word the). These
are called stop words, and are usually eliminated from
queries, since they occur so frequently and tend to pull in more
search results than necessary.
You also see that two terms are very similar to each other:
Ring and Rings. These appear
since there was no stemming in this table. In the code samples
provided in this chapter, you’ll be stemming all terms to collapse
every word into its root form.
At this point, you might have figured out how searching works. A
query for a single term means a single look-up of the inverted index
table, and a retrieval of the list of documents. A query that involves
multiple terms (in which the user is looking for documents where all
the terms appear) is performed by doing an intersection of the result sets for
each term.
For example, consider the sample search “Rings the Elven-kings.”
Using Table 11-2, you know that the term
Rings appears in document 0,
Elven-kings appears in document 0, and
the appears in quite a few documents, including
document 0. The intersection of all of these is just one document
(line)—0—the very first line of the poem.
Let’s work through another sample query: “Ring Mordor.” In this
case, Ring maps to documents 5, 5, and 6, while
Mordor maps to 4 and 7. Since there is no
overlap, you have no results to display for this query.
This examination actually sidesteps a big part of showing search
results: ranking. A lot of the “secret sauce” in
large search engines such as Google is in ranking results. However,
most users of FTS use it to display search results from a single
website, or an otherwise limited source of data. Ranking results is
difficult in such cases and algorithms to do so vary wildly—from
examining where the search term figures in the text, to looking at
page traffic for all the result pages. This also reflects the fact
that ranking in FTS engines today is a bit of an unsolved
problem.
Thus, this examination does not discuss ranking results. Once
you have the list of results, let your imagination run wild on how to
rank them.